2  Technical Infrastructure  

The ETL will run on a secure HPC cloud hosted at the National Genome Center in Denmark and without internet access (although resources can be uploaded if packaged in a Singularity image file). The compute node available for this project has a 40-core 2.1GHz CPU, 192GB RAM, 1.9TB warm storage (XFS drive, NVMe SSD; approx. 1.5TB will be available).

All code will be containerised using Singularity to (i) conform with the HPC nature of the cloud that precludes root access, (ii) prevent version conflicts of software, and (iii) facilitate potential change of cloud system in the future. Singularity containers are brought onto the cloud through a semi-automated process that involves building a Dockerfile (with accompanying metadata such as python_requirements.txt). The SQL parts of the ETL will use PostgreSQL 16.x Non-SQL parts of the ETL will be written in Python 3.10.x or R >= v4.0; should the need arise, other languages can be used as well (e.g., Rust and Julia). All databases will be deployed as containers and the filtered source data will be loaded into the dedicated target source databases, see schematic. The database will use the SSD storage for filesystem.

The update frequency is expected to be quarterly/monthly, perhaps with intermittent need for more frequent runs in extreme cases such as pandemics. In those cases, more compute resources would probably be available to compensate.